-
Notifications
You must be signed in to change notification settings - Fork 120
Add RCU read bindings (#143) and each_process() iterator #250
base: master
Are you sure you want to change the base?
Conversation
Obviously the test case needs work but I'd like some feedback on this design. I'm liking it, but at this point I've made myself familiar with how RCU works :) |
tests/for-each-process/src/lib.rs
Outdated
) -> linux_kernel_module::KernelResult<()> { | ||
let g = rcu::RcuReadGuard::new(); | ||
for mut p in sched::each_process(&g) { | ||
buf.write(&p.comm())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing to a user pointer can sleep, right, so this is incorrect, since you may sleep during an RCU read-side critical section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think you're right. How would you do this in C? Make a local variable, fill it in, and then copy_to_user after rcu_read_unlock?
I can do that, I'm not sure how to get Rust to catch it, and I'm curious if Sparse would have caught it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is tricky, since in the kernel we usually want to avoid dynamically-sized allocation if we can (especially with lengths under user control), so copying the entire output into memory is frowned upon, but also, once we rcu_read_unlock
, we can't guarantee our struct task_struct *
linked list doesn't change out from under us.
In practice, something like this would probably be built on the seq_file
abstraction. seq_file
users have a "position" abstraction that they must use to track progress through their underlying data source, and in practice, the seq_file
infrastructure implements a single-page bounce buffer that callers write into (one page at a time), and then seq_file
copies that back to userspace.
But wait, you might ask, how do you use that position abstraction to keep your position in the task_struct
linked list across each page of data? Do you have to get_task_struct
the current finger into the list?
Well, no, and in fact that would be illegal, since it would allow userspace to construct a long-lived reference to a task by open
ing your file, advancing through it with small reads until it got to the point you wanted, and then just sitting on the fd
. By sending that fd
to another process you could construct a circular list of task references, which is no go.
So instead, you would just keep your cursor as an integer "index into the task list", and, each time you were called upon to return another page of data, you would start with init_task
and advance N
tasks forward and resume printing.
This is, you may note, both (a) quadratic, and (b) racy.
And yes, this is how almost every file in /proc
works; If you read (e.g.) /proc/mounts
and it takes more than a single read()
call to return the full contents, you can miss mount entries, even ones that existed continually for the entire open/read/read/close
cycle, if there are concurrent mounts and unmounts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to printk, which one hopes is safe from RCU context. (Our println!
is heap-free.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, printk
is safe in any context, even really hostile ones like NMIs or stop_machine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good reference on RCUs to get up to speed?
@nelhage's comment leaves me a bit concerned that it's incredibly easy to accidentally violate the rules leading to deadlocks. Is that just life in C as well?
src/sched.rs
Outdated
/// typically the base name of the command, and does not have the | ||
/// full path or arguments. It's a fixed-sized set of bytes, but by | ||
/// convention it's interpreted as NUL-terminated. | ||
pub fn comm(&mut self) -> Vec<u8> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No practical way to return an &[u8]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.0.comm
is only valid while we've got a lock on self.0
, and I don't want to hold a lock for the lifetime of this structure.
We could probably expose the locking to Rust, i.e, something like
impl TaskStruct {
fn lock(&mut self) -> LockedTaskStruct;
}
impl LockedTaskStruct {
fn comm(&mut self) -> &[u8];
}
or perhaps even something like
impl TaskStruct {
fn comm(&mut self, g: &mut LockGuard) -> &[u8];
}
though given the lack of NLL, I'm more hesitant about using an RAII guard to lock a task_struct than I am about making an RCU read-side critical section. An RcuReadGuard only prevents cleanup after making changes, it doesn't prevent a writer from doing the reading, copying, or updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to return a [u8; TASK_COMM_LEN]
? This is basically what get_task_comm
does in C.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that would work. I was actually leaning towards taking an &mut [u8]
but I guess yeah you can just return the array and make the compiler deal with it.
(This code was largely just "make anything work at all")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched it to returning a [u8; TASK_COMM_LEN]
.
It seems that many of my task structs have some garbage after the null byte. For instance, if in Python you open("/proc/self/comm").write("foo")
, you'll find that your comm is now foo\0on\0\0\0...
because it was previously python
. I could return this directly, but because we support memory safety here on Team Rust, and because I had to do shenanigans to convert i8 -> u8 anyway, I opted to cut it off after the first NUL byte.
src/sched.rs
Outdated
unsafe { | ||
task_lock_helper(self.0); | ||
// if only char were unsigned char | ||
let v = (&*(&(*self.0).comm[..] as *const _ as *const [u8])).to_vec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waaaaa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets you an [i8; 16]
(i.e. char[16]
) and this seems to be the preferred non-transmute
way to turn that into a [u8]
. We should probably make a wrapper for this, this is sort of the job of OsStr
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is less shenanigous now. (It's more verbose but I hope the optimizer knows what it's doing.)
There's an overview and some links in #143. I'm about 80% sure that "crossbeam, but you guarantee epochs end because the kernel has precise control over scheduling" is a good tl;dr. The basic idea is it's an extremely reader-optimized reader-writer lock pattern. A "writer" can't modify an RCU-protected structure directly. Instead, it makes a copy of the thing it wants to modify, updates that, atomically updates a pointer, waits for all readers of the old thing to be done, and then frees the old thing. If you know that readers exit quickly, then you can do this with almost no overhead on the reader side and without having an actual lock in memory to point to. So it's mostly useful for protecting linked lists and other structures, like the list of processes in this example. It does not protect data that isn't behind a pointer, which means A) it's not a tool for multiple writers to coordinate (you want a real lock for that) and B) as in this example, actually accessing data inside a specific item in the linked list in a race-free way usually requires taking a real lock. |
This is correct. The kernel rules about (a) when it's safe to |
I think this is an equivalent problem to only calling async-signal-safe functions from a signal handler, or only calling |
https://github.com/japaric/cargo-call-stack might be some inspiration for "you can't call function Y from function X". It seems to work by parsing LLVM IR (it has its own parser) and constructing a call graph. |
I think this is ready to land - any concerns about the current API? |
Only that I need to read all the RCU docs and think hard :-)
…On Fri, Aug 21, 2020 at 10:33 AM Geoffrey Thomas ***@***.***> wrote:
I think this is ready to land - any concerns about the current API?
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#250 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAAGBBCLNQJJZZ3LCUOH4LSB2AU7ANCNFSM4QBHQJIA>
.
--
All that is necessary for evil to succeed is for good people to do nothing.
|
src/sched.rs
Outdated
Some(lvalue) => { | ||
*lvalue = *byte as _; | ||
} | ||
None => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this even reachable? Can you zip .comm
and result
together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Hopefully not?
- Probably? I mean really I want to just call copy_from_slice but there's no way to do that safely while casting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this code gets much simpler if we're using ArrayVec :-)
src/sched.rs
Outdated
_g: &'g rcu::RcuReadGuard, | ||
} | ||
|
||
pub fn each_process(g: &rcu::RcuReadGuard) -> EachProcess { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this return impl Iterator<Item=TaskStruct>>
? no reason to have this type in the public API, right?
let g = rcu::RcuReadGuard::new(); | ||
for mut p in sched::each_process(&g) { | ||
let comm = p.comm(); | ||
let comm_until_nul = comm.split(|c| *c == 0).next().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be a null if the comm name uses up the full buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the buffer as returned by the kernel always includes at least one trailing NUL, but more importantly, split() always returns at least one element so we're fine either way.
/// typically the base name of the command, and does not have the | ||
/// full path or arguments. It's a fixed-sized set of bytes, but by | ||
/// convention it's interpreted as NUL-terminated. | ||
pub fn comm(&mut self) -> [u8; bindings::TASK_COMM_LEN as usize] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On thinking more... do we need a type that's backed by a fixed size array, but maintains a "how much of this is valid" usize
and derefs to slice/mut slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. smallvec is basically this except that it falls back to Vec; I think we want one that just doesn't fall back.
While we're at it I also want to bury the i8/u8 nonsense under an abstraction too (i.e. impl Deref<Target=[u8]>
for something constructed from an &[i8]
). Really what I think I want is OsStr
from libstd :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an RcuReadGuard object for calling functions that expect to be RCU-protected, and add a Rust equivalent of for_each_process and a very light task_struct bindings sufficient to write a simple test/demo.
// - next_task calls rcu_dereference internally, which is safe | ||
// because we hold self._g. | ||
// - The returned reference has lifetime 'g, which is valid | ||
// because self._g lives at least that long. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that holding RCU doesn't guarantee the task list to be unchanged, RCU read-side critical section here only guarantees that the task_struct we reference here won't get freed (in side the RCU read-side critical section). We use another lock "tasklist_lock" to protect the tasklist.
Add an RcuReadGuard object for calling functions that expect to be
RCU-protected, and add a Rust equivalent of for_each_process and very
light task_struct bindings sufficient to write a simple test/demo.